[nexus] Add `instance_watcher` concurrency limit #6527

hawkw · 2024-09-05T18:17:29Z

The instance_watcher background task follows a pattern where it
queries the database for instances, and then spawns a big pile of Tokio
tasks to concurrently perform health checks for those instances. As
suggested by @davepacheco in this comment, there should probably be
a limit on the number of concurrently running health checks to avoid
clobbering the sled-agents with a giant pile of HTTP requests.

This branch sets a global concurrency limit of 16 health checks (which
is fairly conservative, but we can turn it up later if it turns out to
be a bottleneck). The concurrency limit is implemented using the
database query's batch size. Previously, this code was written in a
slightly-weird dual-loop structure, which was intended specifically to
avoid the size of the database query batch acting as a concurrency
limit: we would read a page of sleds from CRDB, spawn a bunch of health
check tasks, and then read the next batch, waiting for the tasks to
complete only once all instance records had been read from the database.
Now, we can implement a concurrency limit by just...not doing that. We
now wait to read the next page of query results until we've run health
checks for every instance in the batch, limiting the number of
concurrently in flight checks.

This has a nice advantage over the naïve approach of using a
tokio::sync::Semaphore or similar, which each health check task must
acquire before proceeding, as the concurrency limit: it also bounds the
amount of Nexus' memory used by the instance watcher. If we spawned all
the tasks immediately but made them wait to acquire a semaphore permit,
there would be a bunch of tasks in memory sitting around doing nothing
until the currently in flight tasks completed. With the batch size as
concurrency limit approach, we can instead avoid spawning those tasks at all
(and, avoid reading stuff from CRDB until we actually need it).

@davepacheco

The `instance_watcher` background task follows a pattern where it queries the database for instances, and then spawns a big pile of Tokio tasks to concurrently perform health checks for those instances. As suggested by @davepacheco in [this comment][1], there should probably be a limit on the number of concurrently running health checks to avoid clobbering the sled-agents with a giant pile of HTTP requests. This branch sets a global concurrency limit of 16 health checks using a `tokio::sync::Semaphore`. This is a pretty arbirary, and fairly low, limit, but we can bump it up later if it turns out to be a bottleneck. [1]: #6519 (review)

@davepacheco

The `instance_and_vmm_list_by_sled_agent` query in `nexus-db-queries` returns a list of all instances, their projects, their active VMMs, and the sleds those active VMMs are running on. This is used by the `instance_watcher` background task to find VMMs to perform health checks for. The query is paginated by sled IDs to limit the amount of data fetched in one query, and so that the returned VMMs are grouped by sled, allowing the `instance_watcher` task to reuse the same sled-agent client connection pool for all VMMs on a sled. Unfortunately, the current pagination is a bit wrong, as @davepacheco kindly pointed out to me in [this comment][1]: > This is paginated by just `sled_id`, but that doesn't quite work. The > pagination marker must be unique for each row. I think you'll want to > use `paginated_multicolumn` with another id here, maybe either the > instance id or vmm id? As is, I'd expect this code to break when the > list of instances returned for a particular sled spans a pagination > boundary. (What I'd expect would happen there is: we process each page > fine up through the first page of results containing that sled id in > it, then we use that sled id as the marker, and the subsequent query > would fetch results with sled ids `>=` the one given, and so you'd > miss all the rows for that sled id that weren't on the first page of > results for that sled id.) I was surprised to see that this is not > documented on either `Paginator::found_batch()` nor `paginated()` :( The only reason we haven't encountered this bug yet is that currently, the batch size for the query is fairly high. However, PR #6527 reduces the batch size substantially to serve as a limit on the number of concurrently in flight health checks, meaning we'll see some VMMs not being health checked any time a sled has more than 16 VMMs on it. This commit fixes the query by changing it to use `paginated_multicolumn` to paginate the results by both the sled's ID *and* the VMM's ID. This is made possible by #6530, which changed `paginated_multicolumn` to allow the query fragment that's paginated to be a `FROM` clause that's a `JOIN`, in addition to a single table --- because we must join the `sled` and `vmm` tables in order to get both IDs. This change required restructuring the order in which the query is constructed, due to Diesel Reasons, so I took the liberty to add some comments while I was moving stuff around. [1]: #6519 (review)

davepacheco

Nice. Basically just some comment suggestions here.

nexus/src/app/background/tasks/instance_watcher.rs

davepacheco · 2024-09-09T18:24:32Z

nexus/src/app/background/tasks/instance_watcher.rs

-                    &opctx.log,
-                )
-            };
+            let mut paginator = Paginator::new(MAX_CONCURRENT_CHECKS);


I feel like somewhere there ought to be a comment explaining that the batch size is used as a coarse way to limit concurrency. Maybe here?

I just noticed while looking at this that this isn't the same as targeting a concurrency level of MAX_CONCURRENT_CHECKS because we'll start 16, then wait until they finish, then start another 16, then wait, ... etc. So the average concurrency will be somewhat less. This is fine but I hadn't appreciated it when we talked about it.

Yes, that's correct. That's an advantage of using the Tokio semaphore for the concurrency limit rather than the DB query...perhaps we should rework this to use a semaphore and make the query batch size smaller...

Co-authored-by: David Pacheco <[email protected]>

@davepacheco

as suggested by @davepacheco in #6527 (comment)

nexus/src/app/background/tasks/instance_watcher.rs

Co-authored-by: David Pacheco <[email protected]>

@davepacheco

The `instance_and_vmm_list_by_sled_agent` query in `nexus-db-queries` returns a list of all instances, their projects, their active VMMs, and the sleds those active VMMs are running on. This is used by the `instance_watcher` background task to find VMMs to perform health checks for. The query is paginated by sled IDs to limit the amount of data fetched in one query, and so that the returned VMMs are grouped by sled, allowing the `instance_watcher` task to reuse the same sled-agent client connection pool for all VMMs on a sled. Unfortunately, the current pagination is a bit wrong, as @davepacheco kindly pointed out to me in [this comment][1]: > This is paginated by just `sled_id`, but that doesn't quite work. The > pagination marker must be unique for each row. I think you'll want to > use `paginated_multicolumn` with another id here, maybe either the > instance id or vmm id? As is, I'd expect this code to break when the > list of instances returned for a particular sled spans a pagination > boundary. (What I'd expect would happen there is: we process each page > fine up through the first page of results containing that sled id in > it, then we use that sled id as the marker, and the subsequent query > would fetch results with sled ids `>=` the one given, and so you'd > miss all the rows for that sled id that weren't on the first page of > results for that sled id.) I was surprised to see that this is not > documented on either `Paginator::found_batch()` nor `paginated()` :( The only reason we haven't encountered this bug yet is that currently, the batch size for the query is fairly high. However, PR #6527 reduces the batch size substantially to serve as a limit on the number of concurrently in flight health checks, meaning we'll see some VMMs not being health checked any time a sled has more than 16 VMMs on it. This commit fixes the query by changing it to use `paginated_multicolumn` to paginate the results by both the sled's ID *and* the VMM's ID. This is made possible by #6530, which changed `paginated_multicolumn` to allow the query fragment that's paginated to be a `FROM` clause that's a `JOIN`, in addition to a single table --- because we must join the `sled` and `vmm` tables in order to get both IDs. This change required restructuring the order in which the query is constructed, due to Diesel Reasons, so I took the liberty to add some comments while I was moving stuff around. [1]: #6519 (review)

hawkw requested a review from davepacheco September 5, 2024 18:17

hawkw added 2 commits September 5, 2024 11:44

just use the database as the concurrency limit

dbf5c23

docs

305aa0a

This was referenced Sep 6, 2024

[nexus] Fix pagination in instance-by-sled query #6534

Merged

[nexus] Mark VMMs on Expunged sleds as Failed #6519

Merged

davepacheco approved these changes Sep 9, 2024

View reviewed changes

hawkw and others added 2 commits September 9, 2024 11:34

Update nexus/src/app/background/tasks/instance_watcher.rs

9142bac

Co-authored-by: David Pacheco <[email protected]>

improve documentation of concurrency limit

45e7e4b

as suggested by @davepacheco in #6527 (comment)

davepacheco reviewed Sep 9, 2024

View reviewed changes

nexus/src/app/background/tasks/instance_watcher.rs Outdated Show resolved Hide resolved

nexus/src/app/background/tasks/instance_watcher.rs Outdated Show resolved Hide resolved

hawkw and others added 2 commits September 9, 2024 12:23

fix typos

56e4cde

Co-authored-by: David Pacheco <[email protected]>

document use of query size for concurrency limit

8841675

hawkw merged commit ad8548a into main Sep 9, 2024
16 checks passed

hawkw deleted the eliza/bg-task-concurrency-limit branch September 9, 2024 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Add `instance_watcher` concurrency limit #6527

[nexus] Add `instance_watcher` concurrency limit #6527

hawkw commented Sep 5, 2024 •

edited

Loading

davepacheco left a comment

davepacheco Sep 9, 2024

hawkw Sep 9, 2024

[nexus] Add instance_watcher concurrency limit #6527

[nexus] Add instance_watcher concurrency limit #6527

Conversation

hawkw commented Sep 5, 2024 • edited Loading

davepacheco left a comment

Choose a reason for hiding this comment

davepacheco Sep 9, 2024

Choose a reason for hiding this comment

hawkw Sep 9, 2024

Choose a reason for hiding this comment

[nexus] Add `instance_watcher` concurrency limit #6527

[nexus] Add `instance_watcher` concurrency limit #6527

hawkw commented Sep 5, 2024 •

edited

Loading